In [ ]:

# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='...', project_access_token='...')

Load and Visualize IBM Debater® Thematic Clustering of Sentences¶

Using the IBM Debater® Thematic Clustering of Sentences dataset, you will explore the dataset, then use it to create a model that dynamically groups sentences by their main topics and themes. This could be used in an application which collects customer feedback to help automatically organize the comments.

In this first notebook, you will load, explore, clean and visualize the data. You will then save the cleaned dataset to the Watson Studio project as a data asset to be loaded in Part 2 - Model Development to evaluate a K-Means clustering model.

The dataset contains 692 articles from Wikipedia, where the number of sections (clusters) in each article ranges from 5 to 12, and the number of sentences per article ranges from 17 to 1614.

Table of Contents¶

0. Prerequisite
1. Load Data
2. Preprocess Data
3. Data Visualization
4. Save the Cleaned Data
Authors

0. Prerequisites¶

Before you run this notebook complete the following steps:

Insert a project token
Import required modules

Insert a project token¶

When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:

# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context

If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:

Click on More -> Insert project token in the top-right menu section

ws-project.mov

This should insert a cell at the top of this notebook similar to the example given above.

If an error is displayed indicating that no project token is defined, follow these instructions.
Run the newly inserted cell before proceeding with the notebook execution below

Import required modules¶

In [2]:

import pandas as pd
from pandas import read_excel
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer

1. Load Data ¶

This notebook uses one dataset from IBM Debater® Thematic Clustering of Sentences dataset named dataset.csv. The method below sets the path for the data, loads and reads the dataset that is already imported into the Watson Studio Project as a data asset.

In [3]:

# Define get data file function
def get_file_handle(fname):
    # Project data path for the raw data file
    data_path = project.get_file(fname)
    data_path.seek(0)
    return data_path

dataset.csv:¶

This file contains 692 articles from Wikipedia, where the number of sections(clusters) in each article ranges from 5 to 12, and the number of sentences per article ranges from 17 to 1614.

Each row in the dataset is a Sentence which is from a SectionTitle and each SectionTitle is from an Article Title. The column Article Link is the original source of the sentence.

In [4]:

# Define filename
DATA_PATH = 'dataset.csv'

# Use pandas to read the data 
data_path = get_file_handle(DATA_PATH)
clustering_df = pd.read_csv(data_path)
clustering_df.head()

Out[4]:

	Article Title	Sentence	SectionTitle	Article Link
0	Moeller High School	Moeller's student-run newspaper, The Crusader,...	School publications	https://en.wikipedia.org/wiki/Moeller_High_School
1	Moeller High School	In 2008, The Crusader won First Place, the sec...	School publications	https://en.wikipedia.org/wiki/Moeller_High_School
2	Moeller High School	The Squire is a student literary journal that ...	School publications	https://en.wikipedia.org/wiki/Moeller_High_School
3	Moeller High School	Paul Keels - play-by-play announcer for Ohio S...	Notable alumni	https://en.wikipedia.org/wiki/Moeller_High_School
4	Moeller High School	Joe Uecker - Ohio State Senator (R-66) .	Notable alumni	https://en.wikipedia.org/wiki/Moeller_High_School

2. Preprocess Data ¶

In order for this data to be used to evaluate a clustering model, clusters need to be assigned. According to the readme file of the dataset (found in the original dataset zip here), each cluster is each SectionTitle. That is, every sentence with the same section is in the same cluster. Thus, you can combine the Article Title and SectionTitle to get a group.

Two columns are added to the dataset to more easily show the clusters by giving each cluster a unique label:

label is the unique string,
label_id is a unique number

In [5]:

clustering_df['label'] = clustering_df.apply(lambda row: row['Article Title'].strip().replace(" ", "_") + ":" + row['SectionTitle'].strip().replace(" ", "_"), axis=1)
clustering_df['label_id'] = clustering_df.label.astype('category').cat.codes
clustering_df.head()

Out[5]:

	Article Title	Sentence	SectionTitle	Article Link	label	label_id
0	Moeller High School	Moeller's student-run newspaper, The Crusader,...	School publications	https://en.wikipedia.org/wiki/Moeller_High_School	Moeller_High_School:School_publications	3414
1	Moeller High School	In 2008, The Crusader won First Place, the sec...	School publications	https://en.wikipedia.org/wiki/Moeller_High_School	Moeller_High_School:School_publications	3414
2	Moeller High School	The Squire is a student literary journal that ...	School publications	https://en.wikipedia.org/wiki/Moeller_High_School	Moeller_High_School:School_publications	3414
3	Moeller High School	Paul Keels - play-by-play announcer for Ohio S...	Notable alumni	https://en.wikipedia.org/wiki/Moeller_High_School	Moeller_High_School:Notable_alumni	3413
4	Moeller High School	Joe Uecker - Ohio State Senator (R-66) .	Notable alumni	https://en.wikipedia.org/wiki/Moeller_High_School	Moeller_High_School:Notable_alumni	3413

Create a dictionary mapping the label ID to the label name.

In [6]:

id_to_category = dict( enumerate(clustering_df.label.astype('category').cat.categories) )

Looking at the number of sentences that correspond to each cluster (label), you can see that one cluster has a lot more sentences.

In [7]:

# One group has a lot more sentences. 
clustering_df.label_id.value_counts()

Out[7]:

32      1308
27       164
30       126
4240      94
3013      91
        ... 
2695       3
5105       3
979        3
3026       3
1365       3
Name: label_id, Length: 5555, dtype: int64

In [8]:

id_to_category[32]

Out[8]:

'1980_Birthday_Honours:United_Kingdom_and_Colonies'

Remove this cluster from the dataset to keep the groups together to test the model using the second notebook. Having this one very large group may not be an accurate representation of the real data.

In [9]:

# Remove rows in that top category
top_id = clustering_df.label_id.value_counts().index[0] 
df = clustering_df.loc[(clustering_df.label != id_to_category[top_id])]

In [10]:

df.head()

Out[10]:

	Article Title	Sentence	SectionTitle	Article Link	label	label_id
0	Moeller High School	Moeller's student-run newspaper, The Crusader,...	School publications	https://en.wikipedia.org/wiki/Moeller_High_School	Moeller_High_School:School_publications	3414
1	Moeller High School	In 2008, The Crusader won First Place, the sec...	School publications	https://en.wikipedia.org/wiki/Moeller_High_School	Moeller_High_School:School_publications	3414
2	Moeller High School	The Squire is a student literary journal that ...	School publications	https://en.wikipedia.org/wiki/Moeller_High_School	Moeller_High_School:School_publications	3414
3	Moeller High School	Paul Keels - play-by-play announcer for Ohio S...	Notable alumni	https://en.wikipedia.org/wiki/Moeller_High_School	Moeller_High_School:Notable_alumni	3413
4	Moeller High School	Joe Uecker - Ohio State Senator (R-66) .	Notable alumni	https://en.wikipedia.org/wiki/Moeller_High_School	Moeller_High_School:Notable_alumni	3413

Next, set the features to be Sentence, which is all the text data we are interested in. You will be predicting the label_id with the model in the second notebook. Below you see that there are 5554 (1 removed) clusters, and on average 8 sentences are in each cluster.

In [11]:

X = df.Sentence
y = df.label_id

print('Total data rows: ', len(X))
print('Unique groups: ', len(y.unique()))
print('Avgerage number of rows per group: ', clustering_df.label_id.value_counts().mean())

Total data rows:  44809
Unique groups:  5554
Avgerage number of rows per group:  8.301890189018902

To test a model, break this dataset into smaller datasets because in the real world, you likely would not want to have 5000 unique clusters. So let's split the data so that each set has about 5 clusters. To do this, randomly take 5000 of the 5554 clusters, then split this into 1000 sets. Now there are 1000 sets to test on (list_of_groups).

In [12]:

np.random.seed(42)  # get reproducible results
number_of_groups = 1000
sampled_categories = np.random.choice(y.unique(), size=5000)
list_of_groups = np.split(sampled_categories, number_of_groups)  # 5 categories in each group

In [13]:

# Convert list_of_groups to ad DataFrame to save to the project
groups_of_themes = pd.DataFrame(pd.Series(np.array(list_of_groups).tolist()), columns=['group'])
groups_of_themes.head()

Out[13]:

	group
0	[2822, 1492, 2014, 4508, 4393]
1	[535, 2896, 3550, 1670, 2837]
2	[739, 659, 1015, 1362, 3938]
3	[4167, 4753, 1516, 1386, 1705]
4	[3029, 3826, 3057, 3969, 5299]

3. Data Visualization ¶

Each theme (cluster) has about 8 sentences on average.

In [14]:

df.label_id.value_counts().describe()

Out[14]:

count    5554.000000
mean        8.067879
std         8.506450
min         3.000000
25%         4.000000
50%         5.000000
75%         9.000000
max       164.000000
Name: label_id, dtype: float64

You want a relatively uniform distribution of themes since you want roughly equal sized cluster data to set. In the histogram shown below, it appears that there is a uniform distribution.

In [15]:

ax = df.label_id.hist()
ax.set_xlabel('Theme ID')
ax.set_ylabel('Count')
ax.set_title('Distribution of Themes')

Out[15]:

Text(0.5, 1.0, 'Distribution of Themes')

Next, look at how many words are included in each theme (label). On average, the themes are 4 words long and the longest theme is 22 words. Fifty percent of themes are four or less words.

In [16]:

df['label'].str.split('_').apply(len).describe()

Out[16]:

count    44809.000000
mean         4.396148
std          2.577010
min          1.000000
25%          3.000000
50%          4.000000
75%          6.000000
max         20.000000
Name: label, dtype: float64

In [17]:

ax = df['label'].str.split('_').apply(len).hist()
ax.set_xlabel('Number of Words in Theme Label')
ax.set_ylabel('Count');

On average, the sentences used have about 21 words. To test the model in the second notebook, you would not want all really short or really long sentences since that is probably not likely to be seen in real comments. About 21 words is aligned with a normal average in a sentence. The histogram below also shows that the number of words is skewed to the right (more sentences are shorter rather than longer).

In [18]:

df['Sentence'].str.split().apply(len).describe()

Out[18]:

count    44809.000000
mean        21.846995
std          9.967591
min          5.000000
25%         14.000000
50%         20.000000
75%         28.000000
max         50.000000
Name: Sentence, dtype: float64

In [19]:

ax = df['Sentence'].str.split().apply(len).hist()
ax.set_xlabel('Number of Words in Each Sentence')
ax.set_ylabel('Count');

4. Save the Cleaned Data ¶

Finally, save the cleaned dataset as a Project asset for later re-use. You should see an output like the one below if successful:

{'file_name': 'themes.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'ibmdebaterthematicclusteringofsen...',
 'asset_id': '...'}

and

{'file_name': 'groups_of_themes.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'ibmdebaterthematicclusteringofsen...',
 'asset_id': '...'}

Note: In order for this step to work, your project token (see the first cell of this notebook) must have Editor role. By default this will overwrite any existing file.

In [ ]:

project.save_data("themes.csv", df.to_csv(index=False, float_format='%g'), overwrite=True)

In [ ]:

project.save_data("groups_of_themes.csv", groups_of_themes.to_csv(index=False), overwrite=True)

Next steps¶

Close this notebook.
Open the Part 2 - Model Development notebook to explore the cleaned dataset.

Authors¶

This notebook was created by the Center for Open-Source Data & AI Technologies.